Job Summary
The Site Reliability Engineer (SRE) for the Axon / Kafka Platform is responsible for ensuring the reliability, availability, scalability, and operational excellence of Mastercard’s enterprise event streaming platform.
Axon is a fully managed, multi‑tenant Kafka‑based platform (Platform‑as‑a‑Service) that supports mission‑critical, high‑volume workloads across regions and environments. This role blends software engineering, distributed systems, and production operations, with a strong focus on incident management, observability, automation, and continuous reliability improvement.
The scope and impact of responsibilities increase with job level, from hands‑on execution to platform‑level ownership and technical leadership.
Key Responsibilities
Platform Reliability & Availability
- Ensure high availability, fault tolerance, and performance of Kafka clusters and Axon platform services.
- Operate and improve reliability mechanisms for brokers, partitions, replicas, Schema Registry, and replication services.
- Define, track, and improve Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Incident Management & Root Cause Analysis
- Participate in on‑call rotations and provide hands‑on support during production incidents.
- Lead or contribute to incident triage, mitigation, and recovery for Kafka and Axon‑related issues.
- Perform root cause analysis (RCA) and drive corrective and preventive actions to closure.
- Partner with application, infrastructure, and security teams during high‑severity incidents.
Monitoring, Alerting & Observability
- Design, implement, and maintain monitoring, alerting, and dashboards for Kafka and Axon services.
- Ensure incidents are proactively detected through alerts rather than customer impact.
- Continuously tune alerts to reduce noise and improve signal quality.
Change, Release & Operational Governance
- Support production changes, maintenance activities, and platform upgrades (e.g., broker patching, certificate renewals, Schema Registry upgrades).
- Review change requests (CRQs), deployment plans, and validation steps to ensure operational readiness.
- Assess risk and ensure rollback and recovery plans are defined and tested.
Automation & Toil Reduction
- Automate repetitive operational tasks, health checks, and validation workflows.
- Improve operational efficiency through scripting, tooling, and platform enhancements.
- Reduce manual intervention and improve mean time to recovery (MTTR).
Platform Enablement & Collaboration
- Partner with application teams to support onboarding, scaling, and operational best practices.
- Provide guidance on Kafka usage patterns, consumer group behavior, partitioning, and resiliency.
- Create and maintain runbooks, SOPs, and operational documentation.
- Share learnings through post‑incident reviews and knowledge‑sharing forums.
Skill Requirements
Administer Confluent Kafka clusters including installation, configuration, upgrades,
and maintenance in Linux environments.
Implement and support Kafka security using SSL/TLS for encryption, SASL authentication, and ACLs for topic-level authorization.
Configure secure Kafka clients (producers, consumers, and connectors) with keystore and truststore management.
Monitor Kafka cluster health and performance to ensure high availability and minimal downtime.
Troubleshoot Kafka-related issues such as broker failures, consumer lag, authentication errors, and connector failures.
Support and manage Kafka Connect connectors, ensuring reliable data ingestion and delivery across systems.
Assist in broker scaling activities such as broker addition/removal and basic partition reassignment to balance cluster load.
Collaborate with application and infrastructure teams to integrate Kafka with enterprise systems and optimize streaming performance.
Performed topic lifecycle management including creation, deletion, partition increase, replication factor planning, and retention tuning.