Senior Site Reliability Engineer Lead Job Details

Senior Site Reliability Engineer Lead

India

Job Description

Senior Site Reliability Engineer Lead

Pune, Maharashtra

Job Summary

The Site Reliability Engineer (SRE) for the Axon / Kafka Platform is responsible for ensuring the reliability, availability, scalability, and operational excellence of Mastercard’s enterprise event streaming platform.

Axon is a fully managed, multi‑tenant Kafka‑based platform (Platform‑as‑a‑Service) that supports mission‑critical, high‑volume workloads across regions and environments. This role blends software engineering, distributed systems, and production operations, with a strong focus on incident management, observability, automation, and continuous reliability improvement.

The scope and impact of responsibilities increase with job level, from hands‑on execution to platform‑level ownership and technical leadership.

Key Responsibilities

Platform Reliability & Availability

Ensure high availability, fault tolerance, and performance of Kafka clusters and Axon platform services.
Operate and improve reliability mechanisms for brokers, partitions, replicas, Schema Registry, and replication services.
Define, track, and improve Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Incident Management & Root Cause Analysis

Participate in on‑call rotations and provide hands‑on support during production incidents.
Lead or contribute to incident triage, mitigation, and recovery for Kafka and Axon‑related issues.
Perform root cause analysis (RCA) and drive corrective and preventive actions to closure.
Partner with application, infrastructure, and security teams during high‑severity incidents.

Monitoring, Alerting & Observability

Design, implement, and maintain monitoring, alerting, and dashboards for Kafka and Axon services.
Ensure incidents are proactively detected through alerts rather than customer impact.
Continuously tune alerts to reduce noise and improve signal quality.

Change, Release & Operational Governance

Support production changes, maintenance activities, and platform upgrades (e.g., broker patching, certificate renewals, Schema Registry upgrades).
Review change requests (CRQs), deployment plans, and validation steps to ensure operational readiness.
Assess risk and ensure rollback and recovery plans are defined and tested.

Automation & Toil Reduction

Automate repetitive operational tasks, health checks, and validation workflows.
Improve operational efficiency through scripting, tooling, and platform enhancements.
Reduce manual intervention and improve mean time to recovery (MTTR).

Platform Enablement & Collaboration

Partner with application teams to support onboarding, scaling, and operational best practices.
Provide guidance on Kafka usage patterns, consumer group behavior, partitioning, and resiliency.
Create and maintain runbooks, SOPs, and operational documentation.
Share learnings through post‑incident reviews and knowledge‑sharing forums.

Skill Requirements

Administer Confluent Kafka clusters including installation, configuration, upgrades,
and maintenance in Linux environments.

Implement and support Kafka security using SSL/TLS for encryption, SASL authentication, and ACLs for topic-level authorization.

Configure secure Kafka clients (producers, consumers, and connectors) with keystore and truststore management.

Monitor Kafka cluster health and performance to ensure high availability and minimal downtime.

Troubleshoot Kafka-related issues such as broker failures, consumer lag, authentication errors, and connector failures.

Support and manage Kafka Connect connectors, ensuring reliable data ingestion and delivery across systems.

Assist in broker scaling activities such as broker addition/removal and basic partition reassignment to balance cluster load.

Collaborate with application and infrastructure teams to integrate Kafka with enterprise systems and optimize streaming performance.

Performed topic lifecycle management including creation, deletion, partition increase, replication factor planning, and retention tuning.

Other Requirements

1.Relevant certifications in Site Reliability Engineering (SRE) or Cloud Services are a plus.

Information at a Glance

Why HCLTech?

At HCLTech, you'll supercharge your potential. You'll find your career. And you'll find your spark. All at a place that knows that helping its customers stay on top starts by putting its people first.

HCLTech is a global technology company, home to more than 226,300 people across 60 countries, delivering industry-leading capabilities centered around digital, engineering, cloud and AI, powered by a broad portfolio of technology services and products. We work with clients across all major verticals, providing industry solutions for Financial Services, Manufacturing, Life Sciences and Healthcare, Technology and Services, Telecom and Media, Retail and CPG, and Public Services. Consolidated revenues as of 12 months ending December 2025 totaled $14.5 billion.

23 Benefits At HCLTech, we believe in empowering our employees with comprehensive benefits that support their professional growth and enhance their well-being. When you sign up for a career with us, you gain access to: https://rmkcdn.successfactors.com/147eb21f/a701dca9-f32d-4fc9-9447-6.svg Industry-benchmarked compensation https://rmkcdn.successfactors.com/147eb21f/b0c54381-ddcc-4a33-9b35-9.svg Best-in-class healthcare benefits https://rmkcdn.successfactors.com/147eb21f/b73027be-7aae-4d36-a090-4.svg Personal time off https://rmkcdn.successfactors.com/147eb21f/d5b4fdfd-2e99-4e26-9878-9.svg Maternity and paternity benefits https://rmkcdn.successfactors.com/147eb21f/3d42b0fc-4652-435a-9ece-c.svg Access to skills / higher education programs/resources https://rmkcdn.successfactors.com/147eb21f/aeddeaf2-9e25-4584-ad11-d.svg Discounts on products and services via Benefit Box https://rmkcdn.successfactors.com/147eb21f/a9609a3b-2700-4b3c-9d90-a.svg Participate in CSR programs and live life with a purpose https://rmkcdn.successfactors.com/147eb21f/c6e33851-710f-4634-bd69-f.svg Opportunities to grow and advance your career Note: The benefits listed above vary depending on the nature of your employment and the country where you work. Some benefits may be available in some countries but not in all.

Provider	Description	Enabled
Vimeo	Vimeo is a video hosting, sharing, and services platform focused on the delivery of video. Opting out of Vimeo cookies will disable your ability to watch or interact with Vimeo videos. Cookie Policy Privacy Policy Terms and Conditions	Consent to cookies from provider Vimeo
YouTube	YouTube is a video-sharing service where users can create their own profile, upload videos, watch, like, and comment on videos. Opting out of YouTube cookies will disable your ability to watch or interact with YouTube videos. Cookie Policy Privacy Policy Terms and Conditions	Consent to cookies from provider YouTube

Provider	Description	Enabled
Google Analytics	Google Analytics is a web analytics service offered by Google that tracks and reports website traffic. Cookie Information Privacy Policy Terms and Conditions	Consent to cookies from provider GoogleAnalytics
Google Tag Manager	Google Tag Manager is a tag management system for conversion tracking, site analytics, remarketing, and more. Privacy Policy Terms and Conditions	Consent to cookies from provider GoogleTagManager
LinkedIn	LinkedIn is an employment-oriented social networking service. We use the Apply with LinkedIn feature to allow you to apply for jobs using your LinkedIn profile. Opting out of LinkedIn cookies will disable your ability to use Apply with LinkedIn. Cookie Policy Cookie Table Privacy Policy Terms and Conditions	Consent to cookies from provider LinkedIn