Senior Site Reliability Engineer Lead Job Details

Senior Site Reliability Engineer Lead

India

Job Description

Senior Site Reliability Engineer Lead

Hyderabad, Telangana

Job Summary

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas

Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

Resource Health by Zone/Region: Monitor for zone-specific outages.
Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

Packet Loss Rate: Dropped packets during transmission.
Latency / Round-Trip Time (RTT): Network travel time.
Network Throughput: Data transfer rate (Bytes In/Out).
Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

Autoscaling Events:

Managed Instance Groups (MIGs): Number of VMs added/removed.
GKE Cluster Autoscaler: Node pool size changes.

Nodes / VMs:

CPU Utilization & Load.
Memory Utilization.
Disk Space Utilization & Disk I/O.

GKE / Application Layer

Pods / Containers:

Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
Container Restarts.
CPU & Memory Usage vs. Requests/Limits.
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).

Horizontal Pod Autoscaler (HPA):

Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

Caching (e.g., Memorystore):

Cache Hit Ratio (Hits vs. Misses).
Latency & Active Connections.

Messaging (e.g., Kafka, Pub/Sub):

Consumer Lag (critical).
Producer/Consumer Throughput.
Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

Cloud Load Balancing:

Request Count & Latency.
HTTP Error Codes (5xx, 4xx).

Cloud SQL (Databases):

CPU & Memory Utilization.
Active Connections & Replication Lag

Key Responsibilities

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

Key Responsibilities:

Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas

Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

Resource Health by Zone/Region: Monitor for zone-specific outages.
Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

Packet Loss Rate: Dropped packets during transmission.
Latency / Round-Trip Time (RTT): Network travel time.
Network Throughput: Data transfer rate (Bytes In/Out).
Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

Autoscaling Events:

Managed Instance Groups (MIGs): Number of VMs added/removed.
GKE Cluster Autoscaler: Node pool size changes.

Nodes / VMs:

CPU Utilization & Load.
Memory Utilization.
Disk Space Utilization & Disk I/O.

GKE / Application Layer

Pods / Containers:

Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
Container Restarts.
CPU & Memory Usage vs. Requests/Limits.
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).

Horizontal Pod Autoscaler (HPA):

Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

Caching (e.g., Memorystore):

Cache Hit Ratio (Hits vs. Misses).
Latency & Active Connections.

Messaging (e.g., Kafka, Pub/Sub):

Consumer Lag (critical).
Producer/Consumer Throughput.
Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

Cloud Load Balancing:

Request Count & Latency.
HTTP Error Codes (5xx, 4xx).

Cloud SQL (Databases):

CPU & Memory Utilization.
Active Connections & Replication Lag

Skill Requirements

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

Key Responsibilities:

Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas

Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

Resource Health by Zone/Region: Monitor for zone-specific outages.
Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

Packet Loss Rate: Dropped packets during transmission.
Latency / Round-Trip Time (RTT): Network travel time.
Network Throughput: Data transfer rate (Bytes In/Out).
Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

Autoscaling Events:

Managed Instance Groups (MIGs): Number of VMs added/removed.
GKE Cluster Autoscaler: Node pool size changes.

Nodes / VMs:

CPU Utilization & Load.
Memory Utilization.
Disk Space Utilization & Disk I/O.

GKE / Application Layer

Pods / Containers:

Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
Container Restarts.
CPU & Memory Usage vs. Requests/Limits.
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).

Horizontal Pod Autoscaler (HPA):

Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

Caching (e.g., Memorystore):

Cache Hit Ratio (Hits vs. Misses).
Latency & Active Connections.

Messaging (e.g., Kafka, Pub/Sub):

Consumer Lag (critical).
Producer/Consumer Throughput.
Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

Cloud Load Balancing:

Request Count & Latency.
HTTP Error Codes (5xx, 4xx).

Cloud SQL (Databases):

CPU & Memory Utilization.
Active Connections & Replication Lag

Other Requirements

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

Key Responsibilities:

Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas

Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

Resource Health by Zone/Region: Monitor for zone-specific outages.
Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

Packet Loss Rate: Dropped packets during transmission.
Latency / Round-Trip Time (RTT): Network travel time.
Network Throughput: Data transfer rate (Bytes In/Out).
Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

Autoscaling Events:

Managed Instance Groups (MIGs): Number of VMs added/removed.
GKE Cluster Autoscaler: Node pool size changes.

Nodes / VMs:

CPU Utilization & Load.
Memory Utilization.
Disk Space Utilization & Disk I/O.

GKE / Application Layer

Pods / Containers:

Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
Container Restarts.
CPU & Memory Usage vs. Requests/Limits.
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).

Horizontal Pod Autoscaler (HPA):

Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

Caching (e.g., Memorystore):

Cache Hit Ratio (Hits vs. Misses).
Latency & Active Connections.

Messaging (e.g., Kafka, Pub/Sub):

Consumer Lag (critical).
Producer/Consumer Throughput.
Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

Cloud Load Balancing:

Request Count & Latency.
HTTP Error Codes (5xx, 4xx).

Cloud SQL (Databases):

CPU & Memory Utilization.
Active Connections & Replication Lag

Information at a Glance

Why HCLTech?

At HCLTech, you'll supercharge your potential. You'll find your career. And you'll find your spark. All at a place that knows that helping its customers stay on top starts by putting its people first.

HCLTech is a global technology company, home to more than 226,300 people across 60 countries, delivering industry-leading capabilities centered around digital, engineering, cloud and AI, powered by a broad portfolio of technology services and products. We work with clients across all major verticals, providing industry solutions for Financial Services, Manufacturing, Life Sciences and Healthcare, Technology and Services, Telecom and Media, Retail and CPG, and Public Services. Consolidated revenues as of 12 months ending December 2025 totaled $14.5 billion.

23 Benefits At HCLTech, we believe in empowering our employees with comprehensive benefits that support their professional growth and enhance their well-being. When you sign up for a career with us, you gain access to: https://rmkcdn.successfactors.com/147eb21f/a701dca9-f32d-4fc9-9447-6.svg Industry-benchmarked compensation https://rmkcdn.successfactors.com/147eb21f/b0c54381-ddcc-4a33-9b35-9.svg Best-in-class healthcare benefits https://rmkcdn.successfactors.com/147eb21f/b73027be-7aae-4d36-a090-4.svg Personal time off https://rmkcdn.successfactors.com/147eb21f/d5b4fdfd-2e99-4e26-9878-9.svg Maternity and paternity benefits https://rmkcdn.successfactors.com/147eb21f/3d42b0fc-4652-435a-9ece-c.svg Access to skills / higher education programs/resources https://rmkcdn.successfactors.com/147eb21f/aeddeaf2-9e25-4584-ad11-d.svg Discounts on products and services via Benefit Box https://rmkcdn.successfactors.com/147eb21f/a9609a3b-2700-4b3c-9d90-a.svg Participate in CSR programs and live life with a purpose https://rmkcdn.successfactors.com/147eb21f/c6e33851-710f-4634-bd69-f.svg Opportunities to grow and advance your career Note: The benefits listed above vary depending on the nature of your employment and the country where you work. Some benefits may be available in some countries but not in all.

Provider	Description	Enabled
Vimeo	Vimeo is a video hosting, sharing, and services platform focused on the delivery of video. Opting out of Vimeo cookies will disable your ability to watch or interact with Vimeo videos. Cookie Policy Privacy Policy Terms and Conditions	Consent to cookies from provider Vimeo
YouTube	YouTube is a video-sharing service where users can create their own profile, upload videos, watch, like, and comment on videos. Opting out of YouTube cookies will disable your ability to watch or interact with YouTube videos. Cookie Policy Privacy Policy Terms and Conditions	Consent to cookies from provider YouTube

Provider	Description	Enabled
Google Analytics	Google Analytics is a web analytics service offered by Google that tracks and reports website traffic. Cookie Information Privacy Policy Terms and Conditions	Consent to cookies from provider GoogleAnalytics
Google Tag Manager	Google Tag Manager is a tag management system for conversion tracking, site analytics, remarketing, and more. Privacy Policy Terms and Conditions	Consent to cookies from provider GoogleTagManager
LinkedIn	LinkedIn is an employment-oriented social networking service. We use the Apply with LinkedIn feature to allow you to apply for jobs using your LinkedIn profile. Opting out of LinkedIn cookies will disable your ability to use Apply with LinkedIn. Cookie Policy Cookie Table Privacy Policy Terms and Conditions	Consent to cookies from provider LinkedIn