Senior Site Reliability Engineer Lead
India
Job Description
Senior Site Reliability Engineer Lead
Hyderabad, Telangana

Job Summary

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

  • Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
  • Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
  • Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas

  • Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

  • Resource Health by Zone/Region: Monitor for zone-specific outages.
  • Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

  • Packet Loss Rate: Dropped packets during transmission.
  • Latency / Round-Trip Time (RTT): Network travel time.
  • Network Throughput: Data transfer rate (Bytes In/Out).
  • Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

  • Autoscaling Events:
    • Managed Instance Groups (MIGs): Number of VMs added/removed.
    • GKE Cluster Autoscaler: Node pool size changes.
  • Nodes / VMs:
    • CPU Utilization & Load.
    • Memory Utilization.
    • Disk Space Utilization & Disk I/O.

GKE / Application Layer

  • Pods / Containers:
    • Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
    • Container Restarts.
    • CPU & Memory Usage vs. Requests/Limits.
    • Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
  • Horizontal Pod Autoscaler (HPA):
    • Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

  • Caching (e.g., Memorystore):
    • Cache Hit Ratio (Hits vs. Misses).
    • Latency & Active Connections.
  • Messaging (e.g., Kafka, Pub/Sub):
    • Consumer Lag (critical).
    • Producer/Consumer Throughput.
    • Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

  • Cloud Load Balancing:
    • Request Count & Latency.
    • HTTP Error Codes (5xx, 4xx).
  • Cloud SQL (Databases):
    • CPU & Memory Utilization.
    • Active Connections & Replication Lag

Key Responsibilities

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

  • Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
  • Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
  • Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas

  • Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

  • Resource Health by Zone/Region: Monitor for zone-specific outages.
  • Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

  • Packet Loss Rate: Dropped packets during transmission.
  • Latency / Round-Trip Time (RTT): Network travel time.
  • Network Throughput: Data transfer rate (Bytes In/Out).
  • Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

  • Autoscaling Events:
    • Managed Instance Groups (MIGs): Number of VMs added/removed.
    • GKE Cluster Autoscaler: Node pool size changes.
  • Nodes / VMs:
    • CPU Utilization & Load.
    • Memory Utilization.
    • Disk Space Utilization & Disk I/O.

GKE / Application Layer

  • Pods / Containers:
    • Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
    • Container Restarts.
    • CPU & Memory Usage vs. Requests/Limits.
    • Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
  • Horizontal Pod Autoscaler (HPA):
    • Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

  • Caching (e.g., Memorystore):
    • Cache Hit Ratio (Hits vs. Misses).
    • Latency & Active Connections.
  • Messaging (e.g., Kafka, Pub/Sub):
    • Consumer Lag (critical).
    • Producer/Consumer Throughput.
    • Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

  • Cloud Load Balancing:
    • Request Count & Latency.
    • HTTP Error Codes (5xx, 4xx).
  • Cloud SQL (Databases):
    • CPU & Memory Utilization.
    • Active Connections & Replication Lag

Skill Requirements

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

  • Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
  • Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
  • Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas

  • Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

  • Resource Health by Zone/Region: Monitor for zone-specific outages.
  • Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

  • Packet Loss Rate: Dropped packets during transmission.
  • Latency / Round-Trip Time (RTT): Network travel time.
  • Network Throughput: Data transfer rate (Bytes In/Out).
  • Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

  • Autoscaling Events:
    • Managed Instance Groups (MIGs): Number of VMs added/removed.
    • GKE Cluster Autoscaler: Node pool size changes.
  • Nodes / VMs:
    • CPU Utilization & Load.
    • Memory Utilization.
    • Disk Space Utilization & Disk I/O.

GKE / Application Layer

  • Pods / Containers:
    • Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
    • Container Restarts.
    • CPU & Memory Usage vs. Requests/Limits.
    • Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
  • Horizontal Pod Autoscaler (HPA):
    • Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

  • Caching (e.g., Memorystore):
    • Cache Hit Ratio (Hits vs. Misses).
    • Latency & Active Connections.
  • Messaging (e.g., Kafka, Pub/Sub):
    • Consumer Lag (critical).
    • Producer/Consumer Throughput.
    • Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

  • Cloud Load Balancing:
    • Request Count & Latency.
    • HTTP Error Codes (5xx, 4xx).
  • Cloud SQL (Databases):
    • CPU & Memory Utilization.
    • Active Connections & Replication Lag

Other Requirements

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

  • Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
  • Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
  • Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas

  • Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

  • Resource Health by Zone/Region: Monitor for zone-specific outages.
  • Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

  • Packet Loss Rate: Dropped packets during transmission.
  • Latency / Round-Trip Time (RTT): Network travel time.
  • Network Throughput: Data transfer rate (Bytes In/Out).
  • Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

  • Autoscaling Events:
    • Managed Instance Groups (MIGs): Number of VMs added/removed.
    • GKE Cluster Autoscaler: Node pool size changes.
  • Nodes / VMs:
    • CPU Utilization & Load.
    • Memory Utilization.
    • Disk Space Utilization & Disk I/O.

GKE / Application Layer

  • Pods / Containers:
    • Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
    • Container Restarts.
    • CPU & Memory Usage vs. Requests/Limits.
    • Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
  • Horizontal Pod Autoscaler (HPA):
    • Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

  • Caching (e.g., Memorystore):
    • Cache Hit Ratio (Hits vs. Misses).
    • Latency & Active Connections.
  • Messaging (e.g., Kafka, Pub/Sub):
    • Consumer Lag (critical).
    • Producer/Consumer Throughput.
    • Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

  • Cloud Load Balancing:
    • Request Count & Latency.
    • HTTP Error Codes (5xx, 4xx).
  • Cloud SQL (Databases):
    • CPU & Memory Utilization.
    • Active Connections & Replication Lag
Information at a Glance

Why HCLTech?

At HCLTech, you'll supercharge your potential. You'll find your career. And you'll find your spark. All at a place that knows that helping its customers stay on top starts by putting its people first.

HCLTech is a global technology company, home to more than 226,300 people across 60 countries, delivering industry-leading capabilities centered around digital, engineering, cloud and AI, powered by a broad portfolio of technology services and products. We work with clients across all major verticals, providing industry solutions for Financial Services, Manufacturing, Life Sciences and Healthcare, Technology and Services, Telecom and Media, Retail and CPG, and Public Services. Consolidated revenues as of 12 months ending December 2025 totaled $14.5 billion.

23 Benefits At HCLTech, we believe in empowering our employees with comprehensive benefits that support their professional growth and enhance their well-being. When you sign up for a career with us, you gain access to: https://rmkcdn.successfactors.com/147eb21f/a701dca9-f32d-4fc9-9447-6.svg Industry-benchmarked compensation https://rmkcdn.successfactors.com/147eb21f/b0c54381-ddcc-4a33-9b35-9.svg Best-in-class healthcare benefits https://rmkcdn.successfactors.com/147eb21f/b73027be-7aae-4d36-a090-4.svg Personal time off https://rmkcdn.successfactors.com/147eb21f/d5b4fdfd-2e99-4e26-9878-9.svg Maternity and paternity benefits https://rmkcdn.successfactors.com/147eb21f/3d42b0fc-4652-435a-9ece-c.svg Access to skills / higher education programs/resources https://rmkcdn.successfactors.com/147eb21f/aeddeaf2-9e25-4584-ad11-d.svg Discounts on products and services via Benefit Box https://rmkcdn.successfactors.com/147eb21f/a9609a3b-2700-4b3c-9d90-a.svg Participate in CSR programs and live life with a purpose https://rmkcdn.successfactors.com/147eb21f/c6e33851-710f-4634-bd69-f.svg Opportunities to grow and advance your career Note: The benefits listed above vary depending on the nature of your employment and the country where you work. Some benefits may be available in some countries but not in all.